Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector arithmetic #153

Open
wants to merge 12 commits into
base: dev
Choose a base branch
from
Open

Vector arithmetic #153

wants to merge 12 commits into from

Conversation

danielpcox
Copy link

@danielpcox danielpcox commented Jun 25, 2021

Adds new Math method to dataframe.DataFrame capable of computing n-ary arithmetic functions against entire selected columns, storing the the result in a new column (or replacing an existing one). Supports int and float64 types. Supports operator specification by string (e.g., "+", "/", etc.) or unary, binary, or trinary int or float64 function (e.g., for supplying a float64 function from Go's math module). For example:

/*  `input` is a 5x4 DataFrame:

   Strings  Floats   Primes Naturals
0: e        2.718000 1      1
1: Pi       3.142000 3      2
2: Phi      1.618000 5      3
3: Sqrt2    1.414000 7      4
4: Ln2      0.693000 11     5
   <string> <float>  <int>  <int>
*/
df := New(
	series.New([]string{"e", "Pi", "Phi", "Sqrt2", "Ln2"}, series.String, "Strings"),
	series.New([]float64{2.718, 3.142, 1.618, 1.414, 0.693}, series.Float, "Floats"),
	series.New([]int{1, 3, 5, 7, 11}, series.Int, "Primes"),
	series.New([]int{1, 2, 3, 4, 5}, series.Int, "Naturals"),
)

// New method `Math` takes a new column name, an operator (string or func) and at least one column name
withNewDiffColumn = df.Math("Diff", "-", "Floats", "Primes")

fmt.Println(withNewDiffColumn)

/* New DataFrame now has a column named "Diff" which is
    the result of subtracting Primes from Floats.
	
    Strings  Floats   Primes Naturals Diff
 0: e        2.718000 1      1        1.718000  
 1: Pi       3.142000 3      2        0.142000  
 2: Phi      1.618000 5      3        -3.382000 
 3: Sqrt2    1.414000 7      4        -5.586000 
 4: Ln2      0.693000 11     5        -10.307000
    <string> <float>  <int>  <int>    <float> 
*/

There are more examples in the docs and tests.

This PR also adds new FindElem method to dataframe.DataFrame which lets a user pull a particular series.Element out of a DataFrame by specifying a column and value to select a row (assumed to be unique), and another column to find a particular value within that row. For example, the following line will search through the "Metric" column of each row for a value "envoy_cluster_upstream_rq_active", and then it will return the series.Element from that row corresponding to the "Value" column:

df.FindElem("Metric", "envoy_cluster_upstream_rq_active", "Value")

@danielpcox danielpcox changed the title Math method Vector arithmetic Jun 25, 2021
This was referenced Jun 26, 2021
Copy link
Contributor

@chrmang chrmang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Daniel,

thank you very much for your contribution. The df.Math is a very interesting function, but is it possible for you to do a little rework? I don't want to reduce it to only math.

  • A footprint like Apply2Element(resultcol string, f func(e1, e2 Element) Element, col1, col2 string) DataFrame gives more flexibility and can work even on string Series.

  • Maybe Apply2Float64(esultcol string, f func(e1, e2 float64) Element, col1, col2 string) DataFrame is also handy. Especially in combination with the "math" package.

  • Automatic coercion can cause subtle bugs in user code. Please, don't use it. If needed, this can be done in f.

  • With df.Filter it is already possible to select one or more Rows of a Dataframe.

  • Maybe we should add a df.Head function to select only the first.

@@ -12,6 +12,8 @@ This project adheres to [Semantic Versioning](http://semver.org/).
- Combining filters with AND
- User-defined filters
- Concatination of Dataframes
- Math for vector operations on multiple columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move it to the 0.12.0 section

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do 👌

@danielpcox
Copy link
Author

Thanks for your comments. I agree this can be reworked to accommodate more than math. However, I'd be very sorry to see the math-specific string flavors of op go, or to add more overhead to a simple operation such that it can no longer be performed succinctly in the 90% case.

Counter-proposal:

What if instead we split df.Math into three different new methods:

  1. The first would have signature Arithmetic(resultcol string, op string, operandcols ...string) DataFrame, which only takes the limited string ops ("+", "-", "*", "/", and "%"), and takes variadic operands like df.Math currently does. I'd also like Arithmetic to be allowed to coerce values, because its purpose is to make common operations as easy as possible, but more on that below.
  2. The second would have signature something like ElemMultiApply(resultcol string, op func(elements ...Element) Element, operandcols ...string) DataFrame, where the user passes a variadic function on Elements and however many columns, and it gets applied without coercion.
  3. The third would have signature something like FloatMultiApply(resultcol string, op interface{}, operandcols ...string) DataFrame and use the same techniques as in the current df.Math to support unary, binary, and trinary op functions on at least float64 values (and I'd really like to be able to automatically convert the ints in mixed operand columns to float64 as necessary to enable pleasant access to the math package on integers - but see below for coercion discussion). It would have to be able to support all three arities to be able to take any function from math directly.

No coercision?

Is the request not to do automatic coercion a gota policy set in stone? I thought there was already automatic coercion in gota. Capply and Rapply in the readme say "casting the types as necessary", and the function I used to figure out what the output should be (int or float64) was already there; I just moved it out of a function to make it accessible to Math.

Are you sure you wouldn't want it in, when it's only automatic coercion in one direction (int -> float64) and it only happens when the input columns are mismatched (at least one float64 column among the operands)? I would personally much prefer a concise API with a few well-documented potential gotchas to a verbose API that makes me do extra work in the most common cases. Coercion is also how I managed to make it possible without much ceremony to pass any function from Go's math package in as op and have it correctly apply to columns of mixed type (they get detected and cast to float64 to be compatible, and the output is always float64).

There's also type coercion in Pandas and R, and people seem to be able to handle it. I think a nice pile of warnings in the documentation would suffice, at least for me, and what we get for it is agility and API clarity. (And the reason I'm using gota in the first place is because idiomatic Go doesn't let me express a complex high level thought succinctly enough to do it often.)

All of that said, I'm flexible here, and it's your project. :)

FindElem

As for FindElem, I think I can just remove that without sacrificing much. I currently perform the same operation in my existing code with df.Filter(...).Elem(0,1).Float() which is succinct enough. (If we do add something that gives you only the first match later though, I'd suggest First or FirstRow rather than Head, because df.head in Pandas shows the first n, defaulting to 5.)

As an aside, for when there are many rows, I was thinking of adding Index(columnname string) DataFrame which would build an index of the values in that column to their row number, and if a user chose to build such an index up-front, anything that needed to search for a value (e.g., Filter) would make use of it to improve performance. That's still possible with the Filter-Elem-Float paradigm for looking up a value.

@chrmang
Copy link
Contributor

chrmang commented Jul 5, 2021

Thank you for your detailed explanation. Let me explain my opinion about coercion.

In many cases it can be a great thing. Especially when dealing with AI it is useful, because sooner or later all variables are float and a loss of one or two digits precision is not a problem. At other use-cases, this would be not acceptable - think of financial services. And it can cause bugs like #154 . This bug was in gota, other bugs can be in user code. It's all about the use-case.
Go is designed as a typed language and we should use the benefits of compile-time type checking. You are right, gota is full of automatic coercion, but this is the preferred way in Python, R and JavaScript. What benefit would people have if we write a 1:1 copy of pandas in Go? Speed? pandas is written in C with a Python wrapper - it's already fast. What else if not type safety?
In the future maybe I will find a way to replace interface {} with generices or generate or ... Nevertheless we should move forward and improve the library usability.

The idea of your counter-proposal is good. You know my opinion about coercion, but let's have a try. Go for it.
Can you change your PR, please?

@danielpcox
Copy link
Author

danielpcox commented Jul 12, 2021

What benefit would people have if we write a 1:1 copy of pandas in Go? Speed? pandas is written in C with a Python wrapper - it's already fast. What else if not type safety?

Hmm... I admit, in my particular case, the only reason I'm using Go is because I have to. So at least one benefit would be that people don't have to leave their preferred language (which is great for other purposes) to manipulate tabular data in a readable way. The company I work for has services written in Go, and we need a compact way to express quite a large collection of high level operations on tables. I don't think idiomatic Go is the best language for doing that, (partly because of static typing, but mostly because the for loop reigns supreme in Go, and because of inline error checking), but I can't rewrite someone's entire service just because I'd prefer to do the number crunching with Pandas. I was pleased to find gota because it bends a few of the laws that makes Go especially painful for high-level data wrangling, and therefore finally made my code readable. It's not Go, it's a DSL, and that's my favorite thing about it.

The idea of your counter-proposal is good. You know my opinion about coercion, but let's have a try. Go for it.

Do you mean "go for it" as-written, or without any automatic int->float64 coercion? If the latter, perhaps I should also add a fourth method to DataFrame that explicitly coerces a column's types, so there isn't a big drop in readability when the types don't match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants